[test optimization] Add filesystem cache for test optimization API requests#7919
[test optimization] Add filesystem cache for test optimization API requests#7919juan-fernandez merged 16 commits intomasterfrom
Conversation
In monorepo setups with thousands of parallel jest sessions (e.g. lage with 3000+ packages), every session independently fetches the same known tests from the API. With 200k+ tests and cursor-based pagination, this causes massive redundant network traffic and delays session startup. Add a filesystem cache in os.tmpdir() keyed on (sha, service, env, repositoryUrl, configurations). The first session acquires an exclusive lock (O_CREAT|O_EXCL), fetches from the API, and writes the cache atomically. Concurrent sessions poll for the cache file to appear. Cache entries expire after 30 minutes. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Integrate both the filesystem cache (this branch) and cursor-based pagination (from master #7866) into fetchFromApi. Also fix a bug where writeToCache referenced the old `knownTests` variable instead of `aggregateTests`. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Overall package sizeSelf size: 5.46 MB Dependency sizes| name | version | self size | total size | |------|---------|-----------|------------| | import-in-the-middle | 3.0.0 | 81.15 kB | 815.98 kB | | dc-polyfill | 0.1.10 | 26.73 kB | 26.73 kB |🤖 This report was automatically generated by heaviest-objects-in-the-universe |
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #7919 +/- ##
==========================================
+ Coverage 74.26% 74.43% +0.17%
==========================================
Files 765 766 +1
Lines 35786 35906 +120
==========================================
+ Hits 26575 26727 +152
+ Misses 9211 9179 -32 Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
|
✅ Tests 🎉 All green!❄️ No new flaky tests detected 🎯 Code Coverage (details) 🔗 Commit SHA: f2c09cc | Docs | Datadog PR Page | Was this helpful? React with 👍/👎 or give us feedback! |
- waitForCache now removes the stale lock file before falling back to a direct fetch, so subsequent processes can re-use the deduplication path - 500-response tests now provide two nock replies to account for the request module's built-in 5xx retry Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…fetches The lock owner now touches the lock file every 30s so waiters can distinguish a slow-but-healthy pagination (e.g. 200k tests over many pages with 20s timeouts + retries) from a crashed owner. Without this, a fetch exceeding 2 minutes would be misclassified as stale, causing waiters to break the lock and fetch concurrently — defeating the deduplication on exactly the large payloads it exists to protect. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Integration tests reuse the same mock server port within a file with different known tests data per test case. The cache would return stale data from a previous test, causing timeouts. Add DD_CIVISIBILITY_KNOWN_TESTS_CACHE_DISABLED env var (default false) and set it in getCiVisAgentlessConfig/getCiVisEvpProxyConfig helpers. When set, getKnownTests bypasses cache entirely and fetches directly. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…KNOWN_TESTS_CACHE_ENABLED Rename env var and flip the default: cache is now off unless explicitly enabled. This avoids interference with integration tests (no env var needed in test helpers) while letting monorepo users opt in for the deduplication benefit. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
BenchmarksBenchmark execution time: 2026-04-06 15:05:16 Comparing candidate commit f2c09cc in PR branch Found 0 performance improvements and 0 performance regressions! Performance is the same for 234 metrics, 26 unstable metrics. |
…uests Extract cache infrastructure into packages/dd-trace/src/ci-visibility/ requests/fs-cache.js with a reusable withCache() wrapper. Apply caching to getKnownTests, getSkippableSuites, and getTestManagementTests. All three are behind a single opt-in flag: DD_EXPERIMENTAL_TEST_REQUESTS_FS_CACHE (default false). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…cal position Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The env var is a string — '!!value' treats 'false' and '0' as truthy. Use isTrue() which correctly handles 'true'/'1' vs 'false'/'0'. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
readFromCache now rejects entries where the data field is undefined or
null. This prevents stale cache files written with an older format
(e.g. { timestamp, knownTests } instead of { timestamp, data }) from
being treated as valid cache hits that return undefined data.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
parts.join('|') is collision-prone: fields containing '|' or undefined
values collapsing with '' can produce identical hashes for different
inputs. JSON.stringify(parts) preserves array structure and
distinguishes undefined from '' and objects from their string form.
Remove redundant JSON.stringify(custom) at call sites since the
top-level JSON.stringify handles it.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Exercise the cached-return path for getSkippableSuites (including correlationId unwrap) and getTestManagementTests. Each test file verifies: fetch + callback shape, cache hit on second call, and lock cleanup. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…eness The fixed 2-minute deadline caused waiters to fall back to direct fetch even when the lock owner was still alive (heartbeat fresh). This defeated deduplication for slow paginated fetches that exceed 2 minutes. Now waiters only fall back when isLockStale() returns true (lock file timestamp older than 2 minutes without heartbeat update), which correctly distinguishes a crashed owner from a slow healthy one. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
writeFileSync truncates the file before writing, creating a brief
window where the lock file is empty. Waiters polling at that moment
read Number('') = 0, compute Date.now() - 0 > 120000 = true, and
misclassify the lock as stale — breaking deduplication.
Use temp file + rename (same pattern as writeToCache) so readers
always see either the old timestamp or the new one, never empty.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- mergeKnownTests is only used internally, no need to export - integration test helpers don't need changes since cache is opt-in Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
What does this PR do?
Adds an opt-in filesystem cache (
os.tmpdir()) for three test optimization API endpoints: known tests, skippable suites, and test management tests. When enabled viaDD_EXPERIMENTAL_TEST_REQUESTS_FS_CACHE, the first process to request data acquires an exclusive lock, fetches from the API, and writes the result to a shared cache file. Concurrent processes wait for the cache to appear instead of making redundant requests. Cache entries expire after 30 minutes.A shared
fs-cache.jsmodule provides a reusablewithCache()wrapper with:O_CREAT|O_EXCLlock filesrename) to prevent partial reads on both cache and lock filesMotivation
In monorepo setups using tools like
lagewith potentially thousands of parallel jest sessions, every session independently fetches the same data from the API.